# 1. Install packages
install.packages(c("arrow", "curl", "vroom", "fs", "bench", "lobstr"))
# 2. Download Large data
large_data_url <- "https://usfedu-my.sharepoint.com/:u:/g/personal/gson_usf_edu/EUcFVZTsXA9DkxNCDa2SsZsBQQAjCmdv52GFDpVuRL7v-Q?download=1"
curl::multi_download(large_data_url, "option_prices.tsv", resume = TRUE)Big Data Analysis in Finance
Module 2
Before begin
Preperation
Make sure to do:
- Download data from url
- Install packages
The Arrow project
Apache Arrow
A cross-language, memory format data specification
A standardized way to represent data in memory
Designed to accelerate data processing and interoperability
Focused on efficient data transfer and sharing
Arrow
Accelerated Data Interchange
Apache Arrow & R
R is a powerful language for data analysis and
Apache Arrow addresses:
- Efficient (in-memory) data storage
- Fast data exchange between R and other languages
- Machine Learning Pipelines
- Cross-language Data Collaboration
Feather Project
Arrow:
- A unified, in-memory columnar data format
- Supports zero-copy sharing between processes
- Ideal for cross-language data processing (e.g., R, Python, C++)
Feather:
- Feather is built on Arrow, using its efficient data representation for disk storage
- Simplicity makes it perfect for smaller datasets or interim storage
Parquet:
- Designed for large-scale, analytical workloads
- Optimal for columnar storage with efficient compression
- Commonly used in big data ecosystems (e.g., Spark, Hadoop)
The Parquet Project
Intro to Parquet
Parquet is a disk-based, columnar storage format that adheres to the principles of Apache Arrow
While Feather emphasizes speed for quick data exchange, Parquet is designed for deep analytics on massive datasets.
Tabular data structure
Columnar vs. Row Storage
Columnar Storage
Data stored by column, not by row
Each column’s data is stored together
Row Storage
Data stored by row, with all columns together
Traditional relational databases often use row storage
Memory Buffer Structure
Columnar Storage
Optimized for analytics and querying
Excellent compression and encoding capabilities
Efficient for analytic querying
Row Storage
Suitable for transactional systems
Efficient for read/write operations by row
Less efficient for analytic queries
Workflow
arrow package for R has low-level interface to C++, and also offers a dplyr backend:
Lazy vs Eager Evaluations
Lazy evaluations
Lazy evaluation postpones the computation of an expression until it is explicitly asked
Saves memory by delaying computations
Useful for large data operations
Explicit call:
arrow::collect()or
dbplyr::collect()
Eager evaluations
Eager evaluation computes immediately when the expression is given
Faster when the result is needed
Real-time data analysis
Interactive programming
Arrow lazy evaluations
arrow::open_dataset() reads the dataset “lazily”
It creates a point of link
Does NOT read until it was explicitly told
collect()should be called to read
c.f.) arrow::read_csv_arrow() or readr::read_csv() reads the file right away (eager)
Benchmarking
What is Code Benchmarking?
There are many different ways to achieve the same goal.
Code benchmarking is the process of:
Measuring and analyzing the performance of your R code
Identify efficient one among many alternatives
Why does it matter?
Because the data is big.
Probably, the operation time difference between 3s and 1s is not significant.
However, because it usually scales with the size of data:
- It matters when it comes to 3 hours vs 1 hour
- or even 3 days vs 1 day
Code Benchmarking
The process of measuring the performance of your code
assess execution time
resource usage
We will use bench package. Make sure it is installed:
Dummy example
Generate a dummy large data file by repeating and row-binding.
Then save to .csv file using:
utils::write.csv()functionreadr::write_csv()functionarrow::write_csv_arrow()functionvroom::vroom_write()function
Dummy example: Writing speed
Efficient storage
gz compressions
Compression can reduce size of the data. GZ csv file compression with vroom:
parquet
An arrow storage solution. Parquet has benefit of both worlds:
- Faster writing and reading
- Faster processing with Arrow memory representation
- Smaller file size with
snappycompression (by default)
File size comparison:
Reading speed benchmarks:
Verdict
Use parquet when your data is large: it is lighter, faster storage solution.
In-Class Exercise
Exercise
Let’s work on real financial data:
All U.S. Equity options
- Each stock has many option series
in
tsvformat, 1 GB+ (Random sampling)End-of-day records
Read lazily
arrow::open_tsv_dataset() opens data lazily, like a setting up a database connection.
arrow package provides two types of readers: eager and lazy.
read_csv_arrow()is eager reader, while open_csv_dataset() is lazy reader.
Lazy evaluation
Let’s check memory footprint of the data
Out-of-core processing
When the data doesn’t fit your memory, still you can perform analysis with arrow.
Perform below operation:
Write subsample
Store them in a separate file: csv
Or store them in another file: parquet
Report the size of the files with R code.
Benchmarking
Use dummy code provided below to check the reading the january_data file
A Note on Parallel processing
CPU architecture
Most jobs are done by single core
Parallel processing
Parallel processing is a technique that utilizes multiple cores, instead of using single core.
Among others, R’s future package handles parallel processing.
Why Not All Cores by Default?
Imagine you’re preparing a simple sandwich. If you invite three friends to help, you each have to coordinate who does what—getting separate cutting boards, passing around ingredients, double-checking steps. The extra coordination might take longer than just making the sandwich yourself!
Overhead is the cost
If the task is small or simple, the overhead can outweigh the benefit of using more cores.
In computing terms, splitting a job across multiple cores similarly involves overhead:
- Setting up processes
- Transferring data
- Merging results
Introduction to HPC
HPC at USF
CIRCE
Managed by Research Computing department
Two main clusters: CIRCE and Secure Cluster for sensitive data
Access through JIRA or email request
Getting Started: login
You can log-in using terminal. For example,
After login
After login, you’ll be on the login node. Suppose you want to get compute resource from the server, and have an interactive prompt (bash).
Interactive session
From 1 node, 1 task
For 1 hour limit
Launches bash shell
Sending Jobs
Usually jobs on the server are run with batch scripts.
For an instance, write a shell script file called my_job.sh such as:
Then submit the job with:
You can monitor your job with:
Suggested Reading
- Apache Arrow in R (https://arrow-user2022.netlify.app/hello-arrow.html)
- Hadley, “R for Data Science” 2ed,
- Ch. 23. Arrow